Survey: Transformer based video-language pre-training
نویسندگان
چکیده
Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer video processing. This survey aims provide a comprehensive overview for Video-Language learning. We first briefly introduce structure as background knowledge, including attention mechanism, position encoding etc. then describe typical paradigm & fine-tuning processing in terms proxy downstream commonly used datasets. Next, we categorize models into Single-Stream Multi-Stream structures, highlight their innovations compare performances. Finally, analyze discuss current challenges possible future research directions pre-training.
منابع مشابه
the effect of lexically based language teaching (lblt) on vocabulary learning among iranian pre-university students
هدف پژوهش حاضر بررسی تاثیر روش تدریس واژگانی (واژه-محور) بر یادگیری لغات در بین دانش آموزان دوره پیش دانشگاهی است. بدین منظور دو گروه از دانش آموزان دوره پیش دانشگاهی (شصت نفر) که در سال تحصیلی 1389 در شهرستان نور آباد استان لرستان مشغول به تحصیل بودند انتخاب شده و به صورت قراردادی گروه آزمایش و گواه در نظر گرفته شدند. در ابتدا به منظور اطمینان یافتن از میزان همگن بودن دو گروه از دانش واژگان، آ...
15 صفحه اولContent-Based Pre-Indexed Video
The viability of large distributed image databases is strongly dependent on the development of new image representations capable of providing support for extended functional-ity, directly in the compressed domain. We have recently introduced one such representation (Library-based coding) which we now augment with statistical pre-indexing schemes, automatically built at the time of encoding, tha...
متن کاملVideo survey of pre-grasp interactions in natural hand activities
Objects are often movable in the environment and do not have to be grasped from the presented placement. •Pre-grasp interaction can adjust object configuration in the environment to improve the task conditions for the final grasp. •Our video observation surveys the variety of pre-grasp interactions used by people in natural task settings. •The observed pre-grasp interactions can be described by...
متن کاملLanguage Generation with Recurrent Generative Adversarial Networks without Pre-training
Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for text generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent neural networks. Consequently, past work has either resorted to pre-training with maximumlikelihood or used convolutional networks for generation. In this work, ...
متن کاملFeed forward pre-training for recurrent neural network language models
The recurrent neural network language model (RNNLM) has been demonstrated to consistently reduce perplexities and automatic speech recognition (ASR) word error rates across a variety of domains. In this paper we propose a pre-training method for the RNNLM, by sharing the output weights of the feed forward neural network language model (NNLM) with the RNNLM. This is accomplished by first fine-tu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: AI open
سال: 2022
ISSN: ['2666-6510']
DOI: https://doi.org/10.1016/j.aiopen.2022.01.001